AITopics | teacher distribution

Collaborating Authors

teacher distribution

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Why Knowledge Distillation Works in Generative Models: AMinimal Working Explanation

Neural Information Processing SystemsJun-15-2026, 21:30:40 GMT

Knowledge distillation (KD) is a core component in the training and deployment of modern generative models, particularly large language models (LLMs). While its empirical benefits are well documented--enabling smaller student models to emulate the performance of much larger teachers--the underlying mechanisms by which KD improves generative quality remain poorly understood. In this work, we present a minimal working explanation of KD in generative modeling. Using a controlled simulation with mixtures of Gaussians, we demonstrate that distillation induces a trade-off between precision and recall in the student model. As the teacher distribution becomes more selective, the student concentrates more probability mass on high-likelihood regions at the expense of coverage, which is a behavior modulated by a single entropy-controlling parameter.

distillation, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

Delta Knowledge Distillation for Large Language Models

Cao, Yihan, Kang, Yanbin, Xing, Zhengming, Jiang, Ruijie

arXiv.org Artificial IntelligenceSep-19-2025

Knowledge distillation (KD) is a widely adopted approach for compressing large neural networks by transferring knowledge from a large teacher model to a smaller student model. In the context of large language models, token level KD, typically minimizing the KL divergence between student output distribution and teacher output distribution, has shown strong empirical performance. However, prior work assumes student output distribution and teacher output distribution share the same optimal representation space, a premise that may not hold in many cases. To solve this problem, we propose Delta Knowledge Distillation (Delta-KD), a novel extension of token level KD that encourages the student to approximate an optimal representation space by explicitly preserving the distributional shift Delta introduced during the teacher's supervised finetuning (SFT). Empirical results on ROUGE metrics demonstrate that Delta KD substantially improves student performance while preserving more of the teacher's knowledge.

distillation, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2509.14526

Genre: Research Report (0.50)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)

Add feedback

A theory of learning with constrained weight-distribution

Neural Information Processing SystemsAug-15-2025, 02:03:42 GMT

The emerging high-quality large structural datasets raise the question of what general functional principles can be gleaned from them. Motivated by this question, we developed a statistical mechanical theory of learning in neural networks that incorporates structural information as constraints.

algorithm, constraint, weight distribution, (16 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.68)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (0.95)

Add feedback

Distillation Scaling Laws

Busbridge, Dan, Shidani, Amitis, Weers, Floris, Ramapuram, Jason, Littwin, Etai, Webb, Russ

arXiv.org Machine LearningFeb-12-2025

We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1) a teacher exists, or 2) a teacher needs training. If many students are to be distilled, or a teacher already exists, distillation outperforms supervised pretraining until a compute level which grows predictably with student size. If one student is to be distilled and a teacher also needs training, supervised learning should be done instead. Additionally, we provide insights across our large scale study of distillation, which increase our understanding of distillation and inform experimental design.

distillation, large language model, machine learning, (20 more...)

arXiv.org Machine Learning

2502.08606

Country:

Europe > Austria > Vienna (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
(21 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Educational Technology > Educational Software (0.48)
Education > Assessment & Standards > Student Performance (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)

Add feedback

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

Shing, Makoto, Misaki, Kou, Bao, Han, Yokoi, Sho, Akiba, Takuya

arXiv.org Artificial IntelligenceFeb-12-2025

Causal language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation, a widely-used technique for transferring knowledge from a large teacher model to a small student model, presents a promising approach for model compression. A significant remaining issue lies in the major differences between teacher and student models, namely the substantial capacity gap, mode averaging, and mode collapse, which pose barriers during distillation.s To address these issues, we introduce Temporally Adaptive Interpolated Distillation (TAID), a novel knowledge distillation approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student's initial distribution towards the teacher's distribution. We provide a theoretical analysis demonstrating TAID's ability to prevent mode collapse and empirically show its effectiveness in addressing the capacity gap while balancing mode averaging and mode collapse. Our comprehensive experiments demonstrate TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies. Large language models are too large. Causal language models (LMs) are increasingly becoming essential tools across various sectors (Malinka et al., 2023; Wu et al., 2023; Zhang et al., 2023a; He et al., 2024). Scaling data size, model size, and training steps has been the primary approach to improve LM performance (Kaplan et al., 2020; Hoffmann et al., 2022; OpenAI et al., 2024), leading to rapid advancements in both proprietary and open-source LMs (Touvron et al., 2023; Abdin et al., 2024; Yang et al., 2024). This paradox of scale hinders the widespread deployment and use of LMs despite their potential and high demand. Knowledge distillation offers a promising prescription. One promising approach to developing compact yet high-performing models is knowledge distillation (KD) (Hinton et al., 2015).

large language model, machine learning, taid, (18 more...)

arXiv.org Artificial Intelligence

2501.16937

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Middle East > Jordan (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
(3 more...)

Genre:

Research Report > Promising Solution (0.68)
Research Report > New Finding (0.48)

Industry: Education > Educational Technology > Educational Software (0.77)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation

Ranzinger, Mike, Barker, Jon, Heinrich, Greg, Molchanov, Pavlo, Catanzaro, Bryan, Tao, Andrew

arXiv.org Artificial IntelligenceOct-2-2024

Various visual foundation models have distinct strengths and weaknesses, both of which can be improved through heterogeneous multi-teacher knowledge distillation without labels, termed "agglomerative models." We build upon this body of work by studying the effect of the teachers' activation statistics, particularly the impact of the loss function on the resulting student model quality. We explore a standard toolkit of statistical normalization techniques to better align the different distributions and assess their effects. Further, we examine the impact on downstream teacher-matching metrics, which motivates the use of Hadamard matrices. With these matrices, we demonstrate useful properties, showing how they can be used for isotropic standardization, where each dimension of a multivariate distribution is standardized using the same scale. We call this technique "PHI Standardization" (PHI-S) and empirically demonstrate that it produces the best student model across the suite of methods studied.

dimension, matrix, phi-s, (15 more...)

arXiv.org Artificial Intelligence

2410.0168

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > France (0.04)

Genre: Research Report (1.00)

Industry: Education (0.55)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective

Liu, Yujian, Zhang, Yang, Jaakkola, Tommi, Chang, Shiyu

arXiv.org Artificial IntelligenceJul-24-2024

This paper investigates Who's Harry Potter (WHP), a pioneering yet insufficiently understood method for LLM unlearning. We explore it in two steps. First, we introduce a new task of LLM targeted unlearning, where given an unlearning target (e.g., a person) and some unlearning documents, we aim to unlearn only the information about the target, rather than everything in the unlearning documents. We further argue that a successful unlearning should satisfy criteria such as not outputting gibberish, not fabricating facts about the unlearning target, and not releasing factual information under jailbreak attacks. Second, we construct a causal intervention framework for targeted unlearning, where the knowledge of the unlearning target is modeled as a confounder between LLM input and output, and the unlearning process as a deconfounding process. This framework justifies and extends WHP, deriving a simple unlearning algorithm that includes WHP as a special case. Experiments on existing and new datasets show that our approach, without explicitly optimizing for the aforementioned criteria, achieves competitive performance in all of them. Our code is available at https://github.com/UCSB-NLP-Chang/causal_unlearn.git.

information, knowledge, llm, (17 more...)

arXiv.org Artificial Intelligence

2407.16997

Country:

North America > United States > New York (0.04)
North America > United States > Virginia (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(6 more...)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (1.00)
Information Technology > Security & Privacy (1.00)
Media > Film (0.70)
Government > Regional Government > North America Government > United States Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Hot PATE: Private Aggregation of Distributions for Diverse Task

Cohen, Edith, Lyu, Xin, Nelson, Jelani, Sarlos, Tamas, Stemmer, Uri

arXiv.org Artificial IntelligenceDec-4-2023

The Private Aggregation of Teacher Ensembles (PATE) framework~\cite{PapernotAEGT:ICLR2017} is a versatile approach to privacy-preserving machine learning. In PATE, teacher models are trained on distinct portions of sensitive data, and their predictions are privately aggregated to label new training examples for a student model. Until now, PATE has primarily been explored with classification-like tasks, where each example possesses a ground-truth label, and knowledge is transferred to the student by labeling public examples. Generative AI models, however, excel in open ended \emph{diverse} tasks with multiple valid responses and scenarios that may not align with traditional labeled examples. Furthermore, the knowledge of models is often encapsulated in the response distribution itself and may be transferred from teachers to student in a more fluid way. We propose \emph{hot PATE}, tailored for the diverse setting. In hot PATE, each teacher model produces a response distribution and the aggregation method must preserve both privacy and diversity of responses. We demonstrate, analytically and empirically, that hot PATE achieves privacy-utility tradeoffs that are comparable to, and in diverse settings, significantly surpass, the baseline ``cold'' PATE.

ensemble, frequency, probability, (17 more...)

arXiv.org Artificial Intelligence

2312.02132

Country:

Europe > Austria > Vienna (0.14)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
North America > United States > California > Santa Clara County > Mountain View (0.04)
(8 more...)

Genre: Research Report (0.40)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)

Add feedback

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Wen, Yuqiao, Li, Zichao, Du, Wenyu, Mou, Lili

arXiv.org Artificial IntelligenceJul-27-2023

Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an f-DISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our f-DISTILL methods. We further derive step-wise decomposition for our f-DISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.

distillation, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2307.1519

Country:

North America > Canada > Alberta (0.14)
Asia > India > Maharashtra > Mumbai (0.05)
Asia > India > Tamil Nadu > Chennai (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Knowledge Distillation of Large Language Models

Gu, Yuxian, Dong, Li, Wei, Furu, Huang, Minlie

arXiv.org Artificial IntelligenceJun-14-2023

Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge from white-box generative LLMs is still under-explored, which becomes more and more important with the prosperity of LLMs. In this work, we propose MiniLLM that distills smaller language models from generative larger language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. Extensive experiments in the instruction-following setting show that the MiniLLM models generate more precise responses with the higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance. Our method is also scalable for different model families with 120M to 13B parameters. We will release our code and model checkpoints at https://aka.ms/MiniLLM.

arxiv preprint arxiv, ini llm, language model, (13 more...)

arXiv.org Artificial Intelligence

2306.08543

Country:

North America > United States (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report (0.70)

Industry: Education (0.89)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback